Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival. Players can choose to join a game as a solo player or with a group of friends (4 players maximum). When playing solo, players are immediately eliminated when they are killed. However, in group play, killed individuals can be revived by their teammates.
We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.
Through our analysis, we aim to understand what characterizes winning players or teams: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.
The main goal of this project is to predict a solo player’s finish placement based on their in-game actions. Specifically, the three subquestions of interest are:
The data comes from the Kaggle competition.
download_data.sh.data.url <- paste0("https://www.dropbox.com/s/mp89gp57cz2dsc7/train_V2.csv.zip?dl=1")
if(!file.exists("./data/train_V2.csv.zip")){
download.file(data.url, destfile = "./data/train_V2.csv.zip", mode = "wb")
}
# Warning: Large dataset (628 MB), will take a minute or so to read.
raw_dat <- read_csv("./data/train_V2.csv.zip")
clean_dat = raw_dat %>%
clean_names() %>%
drop_na(win_place_perc) # Drop rows without outcome variable
Each row in the data contains one player’s post-game stats. A description of all data fields is provided in data/pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 16% of the data, with 720,386 observations. The outcome variable we are trying to predict is win_place_perc.
solo_dat <- clean_dat %>%
filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>% # Remove features that are not relevant to solo mode
mutate(kill_points = ifelse(rank_points == -1 & kill_points == 0, NA, kill_points), # Following codebook explanations
win_points = ifelse(rank_points == -1 & win_points == 0, NA, win_points),
rank_points = ifelse(rank_points == -1, NA, rank_points),
id = as.factor(id),
match_id = as.factor(match_id)) %>%
select(-kill_points, -win_points, -rank_points, -num_groups) # Variables being deprecated and incorrect variables
Additionally, we remove matches in which we have more observations than the max number of players as this is not possible.
# Compute proportions
prop_data = solo_dat %>%
group_by(match_id, max_place, match_duration) %>%
count() %>%
ungroup() %>%
mutate(prop = n/max_place,
remove_game = prop > 1)
# Remove games with proportion greater than 100%
remove_match_ids = prop_data %>%
filter(remove_game) %>%
pull(match_id)
solo_dat = solo_dat %>%
filter(!(match_id %in% remove_match_ids))
We are given a training set and a test set. The outcome variable for the test set will not be provided until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Due to computational costs, we will only use observations from 2000 of the matches (23% of the available data). Within this set, we will create our own “training” (60%), “validation” (20%), and “test” set (20%).
Note that createDataPartition splits the samples into group sections based on the win_place_percentile then randomly samples within the subgroup with \(p = 0.6\). This way, the testing, training, and validation sets have approximately similar proportions of each win percentile. However, we can similarly achieve equal percentile distributions by sampling matches as we have full player data for all matches.
For the rest of the document, the “training set” refers to the one we’ve created.
# Sample 2000 matches
set.seed(1)
matches = sample(solo_dat$match_id, 2000, replace = FALSE)
solo_dat = solo_dat %>% filter(match_id %in% matches)
# Split into train and test
train_ind = sample(c(T,F), p = c(0.6, 0.4), size = length(unique(solo_dat$match_id)), replace = TRUE)
train_solo = solo_dat %>%
filter(match_id %in% unique(solo_dat$match_id)[train_ind])
temp_solo = solo_dat %>%
filter(!(match_id %in% unique(solo_dat$match_id)[train_ind]))
# Split into validation and test
test_ind = sample(c(T,F), p = c(0.5, 0.5), size = length(unique(temp_solo$match_id)), replace = TRUE)
val_solo = temp_solo %>%
filter(match_id %in% unique(temp_solo$match_id)[test_ind])
test_solo = temp_solo %>%
filter(!(match_id %in% unique(temp_solo$match_id)[test_ind]))
glimpse(train_solo)
Observations: 99,040
Variables: 19
$ id <fct> 73348483a5974b, 677089d4226510, cfa11f07d8069...
$ match_id <fct> 85601fe44d519b, 06f853dab32a20, 4563c0409f911...
$ boosts <int> 0, 0, 3, 0, 1, 4, 6, 5, 0, 2, 0, 0, 1, 3, 0, ...
$ damage_dealt <dbl> 17.81, 46.41, 76.44, 19.35, 161.80, 216.20, 3...
$ headshot_kills <int> 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, ...
$ heals <int> 0, 0, 2, 0, 0, 1, 2, 2, 0, 1, 0, 0, 6, 1, 0, ...
$ kill_place <int> 79, 81, 50, 96, 15, 12, 5, 2, 30, 19, 37, 86,...
$ kills <int> 0, 0, 0, 0, 2, 2, 4, 8, 1, 2, 1, 0, 2, 1, 0, ...
$ kill_streaks <int> 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, ...
$ longest_kill <dbl> 0.000, 0.000, 0.000, 0.000, 52.750, 79.950, 6...
$ match_duration <int> 1945, 1493, 1441, 1923, 1349, 1435, 1882, 138...
$ max_place <int> 99, 95, 95, 96, 94, 95, 97, 98, 91, 96, 89, 9...
$ ride_distance <dbl> 129.3, 0.0, 1054.0, 0.0, 0.0, 299.6, 0.0, 0.0...
$ road_kills <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ swim_distance <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0...
$ vehicle_destroys <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ walk_distance <dbl> 471.90, 50.82, 1542.00, 20.34, 536.90, 2363.0...
$ weapons_acquired <int> 3, 1, 2, 1, 6, 4, 3, 5, 4, 3, 1, 2, 5, 3, 2, ...
$ win_place_perc <dbl> 0.2245, 0.1489, 0.8830, 0.0000, 0.6667, 0.925...
In the training set, we have 99040 players and 1058 matches.
We first explored the distribution of each feature by the final finish percentile. Players were first grouped into the 0-19th, 20th-39th, 40th-59th, 60th-79th, or 80th-100th percentile finish. Then we plotted the density of features by percentile groups. Note that due to extreme outliers, we excluded the highest 1% of many of the features for clearer visualizations.
filter_vars = c("boosts", "damage_dealt", "headshot_kills", "heals", "kills", "longest_kill", "ride_distance", "swim_distance", "walk_distance", "weapons_acquired")
train_solo %>%
filter_at(vars(filter_vars), all_vars(. < quantile(., 0.99, na.rm = T))) %>% # Remove outliers
rename_at(vars(filter_vars), ~ sapply(filter_vars, function(str) paste0(str, "*"))) %>% # Mark variables for which we removed outliers with asterisk features
mutate(win_place_cat = floor(win_place_perc / 0.2),
win_place_cat = ifelse(win_place_cat == 5, 4, win_place_cat),
win_place_cat = as.factor(win_place_cat)) %>%
gather("feature", "value", -match_id, -match_duration,
-id, -win_place_perc, -win_place_cat) %>%
ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
facet_wrap(feature ~., scales = "free") +
geom_density() +
labs(title = "Distribution of Features by Finish Percentile",
caption = "* Removed outliers (> 99th percentile) from this feature's density plot",
x = "Value of Features", y = "Density", color = "Percentile") +
scale_color_manual(labels = c("0-19", "20-39", "40-59", "60-79", "80-100"),
values = brewer.pal(5, "OrRd")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Some interesting relationships between the features and the finish percentile:
Use of Items (boosts, heals, and weapons_acquired): Players who finish higher tend to have used more boosts and healing items, and acquired more weapons. This is expected since they stayed in the game for a longer period and have more time to collect and use items. However, it would be interesting to explore which of these features is most predictive of a high finish placement.
Kills & Damage (damage_dealt, kill_place, and kills): Players who finish higher tend to have more kills. They also tend to have dealt more damage. However, in the top finishing group, there is a wide variety in how much damage they inflict. This could potentially indicate strategies that differ in their level of aggressiveness during the course of the game but are similarly successful in achieving a high placement.
Distance Traveled (walk_distance, swim_distance, and ride_distance): Players who finish higher tend to have walked farther. This is likely because they simply survive longer and are force to travel to stay in the safe zone, whereas players who die early don’t get a chance to travel very far. Both swimming and riding in vehicles are rare occurrences, though it appears that players who finish higher also tend to do more of both.
Statistics related to kills seem to be well correlated with finish percentile. Additionally, duration of game does not seem to be strongly correlated with many of the in-game features with the exception of walk_distance.
corr_matrix = train_solo %>%
select(-id, -match_id) %>%
cor()
corrplot(corr_matrix, method = "color", type = "upper", title = "Correlation Plot of Features")
To look at the features that contribute to the largest variability within the covariates, we can use principal components analysis (PCA).
set.seed(1)
# PCA
X = train_solo %>%
select(-win_place_perc, -id, -match_id) %>%
as.matrix()
pc = prcomp(X, center = T, scale = T)
pc[[2]][,1:3] %>% kable()
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| boosts | 0.3147751 | -0.1622042 | 0.1686956 |
| damage_dealt | 0.3672756 | 0.2164339 | -0.1067439 |
| headshot_kills | 0.2867165 | 0.2434542 | -0.1197904 |
| heals | 0.2100020 | -0.2544299 | 0.2051843 |
| kill_place | -0.3696846 | -0.0397688 | -0.0564925 |
| kills | 0.3726581 | 0.2387193 | -0.1104184 |
| kill_streaks | 0.3204852 | 0.2265997 | -0.0981452 |
| longest_kill | 0.3042571 | 0.1016077 | -0.1001660 |
| match_duration | 0.0336809 | -0.4802400 | -0.3069652 |
| max_place | -0.0110764 | 0.0205827 | -0.0705728 |
| ride_distance | 0.1315557 | -0.4846559 | -0.2985481 |
| road_kills | 0.0455555 | -0.1201550 | -0.5130499 |
| swim_distance | 0.0560272 | -0.1532687 | 0.4969032 |
| vehicle_destroys | 0.0411344 | -0.1223628 | -0.2759644 |
| walk_distance | 0.2924897 | -0.2906486 | 0.2586336 |
| weapons_acquired | 0.2405694 | -0.2815642 | 0.1647059 |
# Plot eigenvalues (percentage of variances explained by each principal component)
fviz_eig(pc)
# Plot cumulative variance explained by the first k PCs
data.frame(cumvar = cumsum(pc$sdev^2/sum(pc$sdev^2))) %>%
ggplot(aes(x = 1:length(cumvar), y = cumvar*100)) +
geom_point() +
geom_line() +
labs(title = "Cumulative Variance Explained by the First K PCs",
x = "K", y = "Percent of Variance Explained") +
scale_y_continuous(limits = c(0, 100)) +
theme_bw()
# Plot features in terms of the first two principal components
# Positively correlated features point in the same direction, negative correlated features point to opposite direction
fviz_pca_var(pc,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
The first principal component explains much of the variance we see in our features (35.6%), while the second to sixth principal component all explain between 5 to 10%. The first three components cover over 50% of the variation we see in our data, as seen in the second plot. In the third plot, we visualize each feature’s contribution to the first two principal components and their correlations.
PC 1: The first principal component comprises of features related to damage and kills, primarily the features kills, kill_place, damage_dealt. To a lesser extent, features like kill_steaks, longest_kill, headshot_kills, boosts, and heals also exhibit contribution primarily to the first principal component.
PC 2: The second principal component is characterized by details related to the match setting, i.e. match_duration and ride_distance(which we noted earlier was correlated with match duration).
PC 3: We can examine the table of contributions to each principal component to deduce which features contribute the most to the third principal component. The third principal component explains variation in features not well-represented in the first two components, such as road_kills and swim_distance.
# Plot points on principal components 1 and 2
data.frame(PC1 = pc$x[,1], PC2 = pc$x[,2]) %>%
ggplot(aes(PC1, PC2, color = train_solo$win_place_perc)) +
geom_point(alpha = 0.5) +
labs(title = "Win Place Percent of Training Data by PC1 and PC2",
color = "Win Place Percent") +
theme_bw()
# Plot points on principal components 1 and 3
data.frame(PC1 = pc$x[,1], PC3 = pc$x[,3]) %>%
ggplot(aes(PC1, PC3, color = train_solo$win_place_perc)) +
geom_point(aes(alpha = 0.5)) +
labs(title = "Win Place Percent of Training Data by PC1 and PC3",
color = "Win Place Percent") +
guides(alpha = F) +
theme_bw()
# Plot points on principal components 2 and 3
data.frame(PC2 = pc$x[,2], PC3 = pc$x[,3]) %>%
ggplot(aes(PC2, PC3, color = train_solo$win_place_perc)) +
geom_point(alpha = 0.5) +
labs(title = "Win Place Percent of Training Data by PC2 and PC3",
color = "Win Place Percent") +
theme_bw()
Our outcome feature win_place_perc is pretty well explained by the first two principal components, as lower PC1 and high PC2 values correspond to a higher win place percent. Similarly, we also see correlation between win place percentage and the second and third principal components. This suggests that the largest variations in the features can explain the variation in the outcome as well. We will explore this further in the feature ranking section.
Since the probability of winning increases as the number of players decrease, it is to the players’ advantage to eliminate players rather than just damage players. Thus, players tend to take two approaches to the game:
Aggressive Strategy: players (usually of higher skill) drop in a populated location and attempt to take control of the location. This is a high-risk and high-reward strategy as players are more likely to die due to an increased number of encounters but are rewarded with superior weapons and boosts.
Passive Strategy: rather than eliminating other players yourself, the passive strategy relies on surviving until other players have been eliminated. This typically involves hiding to minimize encounters until only a few players remain, then selectively taking fights until the player is the only one remaining.
To explore whether we could identify strategies to win, we looked at the distribution of the percentage of total players killed by game winners.
train_solo %>% filter(win_place_perc == 1) %>%
mutate(prop_killed = kills/max_place) %>%
ggplot(aes(x = prop_killed)) +
geom_histogram(aes(y = ..density..), color = 'white', binwidth = 0.01) +
geom_density(color = 'blue') +
labs(title = "Distribution in Proportion of Players Killed for Match Winners",
x = "Proportion of Players Killed", y = "Density") +
theme_bw()
If winners used a variety of aggression strategies, we would expect to see a multi-modal distribution in the proportion of players killed for match winners. However, a histogram of the data reveals that for the most part, winners tend to eliminate on average 0.07 of players in the match. It is worth noting that there is a decent fraction of winners with 0 to 0.1 proportion of players killed (where the winner likely eliminated only the second place player), but the small mass indicates that the passive approach tends to be less successful.
Additionally, after further consideration of the data, we realized that does not seem well suited to address the question of strategies for success for the following reasons:
We characterize player strategy by the proportion of total players killed. However, players who are more skilled (i.e. more accurate in their shots, better at positioning, etc.) tend to favor an aggressive strategy while players less confidence in their mechanical skill may favor a passive strategy. Thus, rather than measuring the effectiveness of a strategy, we would be looking at the mechanical skill of a player.
Aggressive players would likely be killing throughout the whole game while passive players would have most of their kills concentrated in the end. Thus, longitudinal data about when a player killed another player or data on where a particular player dropped on the map may be more suited to understand the effectiveness of player strategies.
Thus, we did not use k-means clustering to explore whether there were clusters of individuals who used similar strategies as k-means clustering always results in \(k\) clusters, whether those clusters truly exist or not.
In this section, we aim to answer our remaining two questions:
To answer these questions, we will fit the following models, test our model on an independent test set, and examine feature importance scores:
In training our models, unless stated otherwise, we will include all features except for the match ID and player ID, since neither feature is going to generalize for predicting new games, which is what we are interested in.
First, we fit a linear regression model with 5-fold cross-validation. Since the win place percentage is a value between 0 and 1, we apply a log transformation (adding 1 to ensure all values are defined), so that it better fulfills the assumption that \(y\) is a continuous variable on the real line. However, this still doesn’t guarantee that the predicted values will be between 0 and 1, so we constrain the predicted value to be 0 if it is negative and 1 if it is above 1.
set.seed(1)
tc = trainControl(method = "cv", number = 5)
# Fit linear model using step-wise model selection using AIC
lm_model = train(log(win_place_perc + 1) ~ . - match_id - id, data = train_solo,
method = "lm", trControl = tc)
print(lm_model)
Linear Regression
99040 samples
18 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 79232, 79232, 79232, 79231, 79233
Resampling results:
RMSE Rsquared MAE
0.0685495 0.8872068 0.0495968
Tuning parameter 'intercept' was held constant at a value of TRUE
# Save predictions
predictions_solo = data.frame(win_place_perc = val_solo$win_place_perc,
lm_pred = exp(predict(lm_model, val_solo)) - 1) %>%
mutate(lm_pred = case_when(lm_pred < 0 ~ 0,
lm_pred > 1 ~ 1,
lm_pred >= 0 & lm_pred <= 1 ~ lm_pred)) # Constrain values to [0,1]
# Plot predicted vs true values
predictions_solo %>%
ggplot(aes(x = win_place_perc, y = lm_pred)) +
geom_point(alpha = 0.5, size = 0.1) +
labs(title = "Predicted Versus Actual Win Place Percentage (Linear Regression)",
x = "Actual Win Place Percentage", y = "Predicted Win Place Percentage") +
theme_bw()
As we noted earlier in the data exploration, some of the features are highly correlated and/or seem to provide little signal (such as distance traveled). To solve these issues, we implement elastic net regression which uses a ridge-regression-like penalty to adjust for correlated features and lasso penalty to shrink non-informative features to zero. Using 5-fold cross-validation in the training set, we tune regularization hyper parameters.
set.seed(1)
# Fit lasso_model using glmnet
lasso_model = train(log(win_place_perc + 1) ~ . - match_id - id, data = train_solo,
method = "glmnet", trControl = tc)
print(lasso_model)
glmnet
99040 samples
18 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 79232, 79232, 79232, 79231, 79233
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.10 0.0003230778 0.06856201 0.8871773 0.04988400
0.10 0.0032307776 0.06913346 0.8856577 0.05166476
0.10 0.0323077761 0.08220437 0.8464587 0.06555230
0.55 0.0003230778 0.06856463 0.8871703 0.04985993
0.55 0.0032307776 0.06962549 0.8844075 0.05234333
0.55 0.0323077761 0.09562115 0.7970951 0.07656673
1.00 0.0003230778 0.06858067 0.8871194 0.04986878
1.00 0.0032307776 0.07029232 0.8826508 0.05319175
1.00 0.0323077761 0.09996480 0.7933901 0.08073803
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.1 and lambda
= 0.0003230778.
#coef(lasso_model$finalModel, lasso_model$bestTune$lambda)
# Add predictions to df
predictions_solo = predictions_solo %>%
mutate(lasso_pred = exp(predict(lasso_model, val_solo)) - 1,
lasso_pred = case_when(lasso_pred < 0 ~ 0,
lasso_pred > 1 ~ 1,
lasso_pred >=0 & lasso_pred <= 1 ~ lasso_pred)) # Constrain values to [0,1]
# Plot predicted vs true values
predictions_solo %>%
ggplot(aes(x = win_place_perc, y = lasso_pred)) +
geom_point(alpha = 0.5, size = 0.1) +
labs(title = "Predicted Versus Actual Win Place Percentage (Lasso Regression)",
x = "Actual Win Place Percentage", y = "Predicted Win Place Percentage") +
theme_bw()
To account for interactions between features, we can use a random forest model. Ensemble methods like random forest are also known to generally perform better than regression models. Due to computational costs, however, we make the following choices:
Again with 5-fold cross-validation, we tune hyper parameters for random forest.
set.seed(1)
# Warning: Takes a while to run (~3 minutes)
imp_features = varImp(lasso_model$finalModel) %>%
mutate(variable = rownames(.)) %>%
arrange(desc(Overall)) %>%
#slice(1:10) %>%
pull(variable)
rf_model = train(as.formula(paste("win_place_perc ~ ", paste(imp_features, collapse = "+"))),
data = sample_n(train_solo, 10000), method = "rf", ntree = 100, importance = T, trControl = tc)
print(rf_model)
Random Forest
10000 samples
16 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 8000, 8000, 8001, 8000, 7999
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.08672402 0.9223775 0.06691531
9 0.06857612 0.9457451 0.04739896
16 0.07026248 0.9430108 0.04814976
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.
# Add predictions to df
predictions_solo = predictions_solo %>%
mutate(rf_pred = predict(rf_model, val_solo))
# Plot predicted vs true values
predictions_solo %>%
ggplot(aes(x = win_place_perc, y = rf_pred)) +
geom_point(alpha = 0.5, size = 0.1) +
labs(title = "Predicted Versus Actual Win Place Percentage (Random Forest)",
x = "Actual Win Place Percentage", y = "Predicted Win Place Percentage") +
theme_bw()
Note that the spread of points narrows for players that place higher (actual win place percentage approaches 0). This indicates that we are able to predict more accurately for players that place higher.
Which features are most predictive? With the generic varImp function, we can compare the relative importance of features. The meaning of varImp for each model is:
# Table of variable importance scores
lm_imp = varImp(lm_model$finalModel) %>% # absolute value of t-statistic
mutate(variable = rownames(.)) %>%
arrange(desc(Overall)) %>%
rename(lm_imp = Overall)
lasso_imp = varImp(lasso_model$finalModel) %>% # absolute value of coefficients
mutate(variable = rownames(.)) %>%
arrange(desc(Overall)) %>%
rename(lasso_imp = Overall)
rf_imp = varImp(rf_model$finalModel) %>% # importance scores
mutate(variable = rownames(.)) %>%
arrange(desc(Overall)) %>%
rename(rf_imp = Overall)
imp = full_join(lm_imp, lasso_imp, by = "variable") %>%
full_join(., rf_imp, by = "variable") %>%
select(variable, everything()) %>%
mutate_at(vars(lm_imp:rf_imp), function(x){scale(x, center = F)}) # Normalize importance scores within models
# Note: for some reason, can't do this step in mutate:
imp$variable = factor(imp$variable, levels = imp$variable[order(imp$rf_imp, decreasing = T)])
imp = imp %>%
gather(model, importance, -variable)
# Make a plot for variable importance
imp %>%
ggplot(aes(x = variable, y = importance, fill = model)) +
geom_col(position = "dodge") +
labs(title = "Feature Ranking",
x = "Feature",
y = "Importance (Scaled)",
fill = "Model") +
scale_fill_manual(labels = c("Elastic Net", "Linear", "Random Forest"),
values = c("#999999", "#E69F00", "#56B4E9")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
While there is not much agreement in the features that each model regards to be important, this is likely due to highly correlated features; for example, if we look at the features kill_place and kill_streaks, the former is rated as highly important by the linear regression and random forest, while the latter is rated as highly important by elastic net regression. However, the two features are known to be highly correlated with each other.
Let’s take a look at the importance score for random forest. The top features are … ** To fill in **
To summarize:
The most important predictor is the number of kills a player makes. The strongest signal lies in the relative number of kills, as indicated by kill_place, rather than the absolute number of kills or other aggressiveness indicators like weapons_acquired, kill_streaks, and headshot_kills. What this suggests is that while a few players may succeed with less confrontational playing strategies, the vast majority of top players pursue highly aggressive tactics. In the context of battle royale games, this means that you need to kill other players or you will be eliminated.
Among items used, boosts and weapons_acquired are the strongest predictors of outcome. This is interesting because one would expect that among items, weapons would be the dominant predictor. However, the acquisition of weapons may plateau over time once you have strong weapons. On the other hand, boosts enable increased health regeneration over time and have a small movement speed bonus when the amount of boosts consumed is beyond a particular threshold (and decays over time). Players tend to save boosts as an additional advantage in the later stages of the game when most players have powerful weapons. Consequently, the number of boosts consumed may be a stronger stronger signal than heals.
We will compare our models with the following metrics on the validation set:
Mean absolute error (MAE): Represents the average absolute deviation. \[MAE = \frac{\sum_{i=1}^n |\hat{y}_i - y_i|}{n}\]
Self-defined accuracy metric (SDAM(x)): This metric is a function of a cutoff value \(x\). If the predicted outcome is within \(x\%\) of the actual win place percentage, we classify it as a “correct” prediction. Otherwise, it is an incorrect prediction.
\[SDAM(x) = \frac{\sum_{i=1}^n \mathbb{1}_{|\hat{y}_i - y_i| <= x}}{n}\]
Classification of Winners: We can compute the ROC curve by turning our predictions into a classification problem. Given a predicted win place percentage, we classify the player as a winner if its predicted value is less than a cutoff value \(x\%\). For different cutoff values, we can then compute the sensitivity (true positive rate, or the proportion of actual winners we classify as such) and specificity (true negative rate, or the proportion of actual losers we classify as such).
predictions_solo = predictions_solo %>%
mutate(lm_ae = abs(lm_pred - win_place_perc),
lasso_ae = abs(lasso_pred - win_place_perc),
rf_ae = abs(rf_pred - win_place_perc))
# Summarize metrics
metrics_solo = predictions_solo %>%
mutate(lm_correct = lm_ae <= 0.05,
lasso_correct = lasso_ae <= 0.05,
rf_correct = rf_ae <= 0.05) %>%
summarize(lm_mae = mean(lm_ae),
lm_sdam = mean(lm_correct),
lasso_mae = mean(lasso_ae),
lasso_sdam = mean(lasso_correct),
rf_mae = mean(rf_ae),
rf_sdam = mean(rf_correct)) %>%
gather() %>%
separate(key, c("model", "metric"))
# Plot MAE
metrics_solo %>%
filter(metric == "mae") %>%
ggplot(aes(x = model, y = value, label = round(value, 3))) +
geom_col() +
geom_label() +
labs(title = "Comparison of MAE", x = "Model", y = "MAE") +
scale_x_discrete(labels = c("Elastic Net", "Linear", "Random Forest")) +
theme_bw()
# Plot SDAM
cutoff = seq(0, 0.5, by = 0.01)
prop_lm = sapply(cutoff, function(x) mean(predictions_solo$lm_ae < x))
prop_lasso = sapply(cutoff, function(x) mean(predictions_solo$lasso_ae < x))
prop_rf = sapply(cutoff, function(x) mean(predictions_solo$rf_ae < x))
data.frame(cutoff = rep(cutoff, 3),
prop = c(prop_lm, prop_lasso, prop_rf),
model = rep(c("Linear", "Elastic Net", "Random Forest"), each = length(cutoff))) %>%
ggplot(aes(x = cutoff, y = prop, color = model)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 0.05, linetype = "longdash", color = "slategray") +
annotate(geom = "text", x = 0.05, label="SDAM(5)", y = 0.8, colour = "slategray", angle = 90, vjust = -1.1) +
labs(title = "Comparison of SDAM",
subtitle = "Proportion of Predictions Within x% of Actual Win Place Percentage",
x = "|Predicted - Actual Win Place Percentage|",
y = "SDAM",
color = "Model") +
theme_bw()
# Plot ROC
predictions_solo = predictions_solo %>%
mutate(winner = ifelse(win_place_perc == 1, 1, 0))
cutoff = seq(0, 1, by = 0.01)
sens_lm = sapply(cutoff, function(x) sum(predictions_solo$lm_pred > x & predictions_solo$winner)/sum(predictions_solo$winner))
spec_lm = sapply(cutoff, function(x) sum(predictions_solo$lm_pred <= x & !predictions_solo$winner)/sum(!predictions_solo$winner))
sens_lasso = sapply(cutoff, function(x) sum(predictions_solo$lasso_pred > x & predictions_solo$winner)/sum(predictions_solo$winner))
spec_lasso = sapply(cutoff, function(x) sum(predictions_solo$lasso_pred <= x & !predictions_solo$winner)/sum(!predictions_solo$winner))
sens_rf = sapply(cutoff, function(x) sum(predictions_solo$rf_pred > x & predictions_solo$winner)/sum(predictions_solo$winner))
spec_rf = sapply(cutoff, function(x) sum(predictions_solo$rf_pred <= x & !predictions_solo$winner)/sum(!predictions_solo$winner))
data.frame(cutoff = rep(cutoff, 3),
sens = c(sens_lm, sens_lasso, sens_rf),
spec = c(spec_lm, spec_lasso, spec_rf),
model = rep(c("Linear", "Elastic Net", "Random Forest"), each = length(cutoff))) %>%
ggplot(aes(x = 1-spec, y = sens, color = model)) +
geom_point() +
geom_line() +
theme_bw() +
labs(title = "ROC Curve",
x = "False Positive Rate (1 - Specificity)",
y = "Sensitivity",
color = "Model")
From the above plots, we can see that random forest performs best on all three metrics. This is despite the restrictions we had to place in order to run the random forest model in a reasonable amount of time. Its MAE is 0.058, which means that on average, the predicted win place percentage is 0.058 off from the true win place percentage. Its SDAM is consistently higher than the SDAM for linear regression or elastic net regression. For example, its SDAM(5) is 0.586, which means that for 58.6% of observations in our validation set, the predicted value is within 5% of the true win place percentage. Lastly, the area under its ROC curve is the largest, which indicates that it performs best at classifying who the winners are (sensitivity = 0.96, specificity = 0.82).
As we have found that the random forest model has the highest accuracy in the training set, we validate our results on an independent test set.
pred_test = data.frame(rf_test_pred = predict(rf_model, test_solo)) %>%
mutate(win_place_perc = test_solo$win_place_perc,
rf_test_ae = abs(rf_test_pred - win_place_perc),
rf_test_correct = rf_test_ae <= 0.05,
winner = ifelse(win_place_perc == 1, 1, 0))
# Plot SDAM
cutoff = seq(0, 0.5, by = 0.01)
prop_test = sapply(cutoff, function(x) mean(pred_test$rf_test_ae < x))
data.frame(cutoff = rep(cutoff, 2),
prop = c(prop_rf, prop_test),
set = rep(c("Validation", "Test"), each = length(cutoff))) %>%
ggplot(aes(x = cutoff, y = prop, color = set)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 0.05, linetype = "longdash", color = "slategray") +
annotate(geom = "text", x = 0.05, label="SDAM(0.05)", y = 0.8, colour = "slategray", angle = 90, vjust = -1.1) +
theme_bw() +
labs(title = "Comparison of SDAM",
subtitle = "Proportion of Predictions Within x% of Actual Win Place Percentage",
x = "|Predicted - Actual Win Place Percentage|",
y = "SDAM",
color = "Dataset") +
theme_bw()
# Plot ROC
cutoff = seq(0, 1, by = 0.01)
sens_test_rf = sapply(cutoff, function(x) sum(pred_test$rf_test_pred > x & pred_test$winner)/sum(pred_test$winner))
spec_test_rf = sapply(cutoff, function(x) sum(pred_test$rf_test_pred <= x & !pred_test$winner)/sum(!pred_test$winner))
data.frame(cutoff = rep(cutoff, 2),
sens = c(sens_rf, sens_test_rf),
spec = c(spec_rf, spec_test_rf),
set = rep(c("Validation", "Test"), each = length(cutoff))) %>%
ggplot(aes(x = 1-spec, y = sens, color = set)) +
geom_point() +
geom_line() +
theme_bw() +
labs(title = "ROC Curve",
x = "False Positive Rate (1 - Specificity)",
y = "Sensitivity",
color = "Dataset")
pred_test_metrics = pred_test %>%
summarize(rf_test_mae = mean(rf_test_ae),
rf_test_sdam = mean(rf_test_correct))
On the test set, random forest has a MAE of 0.047 and a SDAM(5) of 0.653, which is similar to what we saw in the validation set. We can in the above plots of SDAM and ROC that the random forest model performs equally well in the test set as in the validation set.
Best-performing model: Random forest
We use three different models to predict the win place percentage: linear regression, elastic net regression, and random forest. Using mean absolute error (MAE), our self-defined accuracy metric (SDAM) quantifying the proportion of predictions that fall within x% of the true win place percentage, and the ROC curve, we found that the random forest model performed best on the validation set. On an independent test set, we achieved a MAE of 0.047 and an SDAM of 0.653. We also achieved a sensitivity of 0.88 and specificity 0.86, meaning that we are able to correctly classify 88% of winners correctly while correctly classifying 86% of losers.
Higher predictive accuracy for top players
With the random forest model in particular, we can see from its figure of predicted vs actual values that our predictions are more accurate for players with higher placements. This makes sense since there should be increasingly higher demands on skill the higher you place and therefore more signal in the data. In other words, the difference in skill between a player who places 60 vs a player who places 90 out of 100 is likely much smaller than the difference between a player who places 1 vs a player who places 30 out of 100. As a result, we can more accurately distinguish between places the higher you place.
As casual gamers and biostatistics students, we are interested in features and strategies that characterize top-ranking PUBG players. To answer these questions, we looked at player data for solo-players in PUBG. The data consisted of 16 features describing player kills, damage, items used and acquired, and match attributes. In our exploratory data analysis, we noted that the more kills a player had, the more boosts used by a player, and the more weapons acquired by a player, the higher their final placement percentile. The importance of kills and boosts on ranking players was further reiterated by our principal components analysis and our prediction model for players’ final placement percentile.
To model player percentile finish, we used a simple linear model with stepwise model selection using AIC, elastic net regression, and random forest. Of the three models, random forest performed the best through comparison of the three models using MAE, SDAM, and ROC: on average, the predicted win place percentage is 0.058 off from the true win place percentage; 58.6% of predictions for the win place percentage in the validation set were within within 5% of the true win place percentage. The random forest model performed similarly well for the test and validation sets.
Though we were unable to perform clustering to identify groups of individuals with similar game strategies due to the nature of the data, our data analysis seems to indicate that aggressive players who have more kills relative to other players perform better on average.
Our data analysis has a number of limitations. The first is that for computational purposes, we only used around 20% of the total amount of solo-player data. In the future, we could consider repeating the analysis with a larger data set. Second, we only examined solo player data, so our conclusions are not generalizable to group gameplay, which is complicated by aspects of teamwork such as revival and positioning. More work would need to be done to generalize our data analysis to group work. Finally, to provide a more complete answer to strategies for success, we would need longitudinal data regarding key gameplay events such as kills, deaths, the circle closing, and drop locations.